Cleaning for Web Mining through Feature Weighting

نویسنده

  • Lan Yi
چکیده

Unlike conventional data or text, Web pages typically contain a large amount of information that is not part of the main contents of the pages, e.g., banner ads, navigation bars, and copyright notices. Such irrelevant information (which we call Web page noise) in Web pages can seriously harm Web mining, e.g., clustering and classification. In this paper, we propose a novel feature weighting technique to deal with Web page noise to enhance Web mining. This method first builds a compressed structure tree to capture the common structure and comparable blocks in a set of Web pages. It then uses an information based measure to evaluate the importance of each node in the compressed structure tree. Based on the tree and its node importance values, our method assigns a weight to each word feature in its content block. The resulting weights are used in Web mining. We evaluated the proposed technique with two Web mining tasks, Web page clustering and Web page classification. Experimental results show that our weighting method is able to dramatically improve the mining results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Page Cleaning for Web Mining through Feature Weighting

Unlike conventional data or text, Web pages typically contain a large amount of information that is not part of the main contents of the pages, e.g., banner ads, navigation bars, and copyright notices. Such irrelevant information (which we call Web page noise) in Web pages can seriously harm Web mining, e.g., clustering and classification. In this paper, we propose a novel feature weighting tec...

متن کامل

Advanced Techniques in Web Data Pre-processing and Cleaning

Central to successful e-business is the construction of web sites that attract users, capture user preferences, and entice them into making a purchase. Web mining is diverse data mining applied to categorize both the content and structure of web sites with the goal of aiding e-business. Web mining requires knowledge of the web site structure (hyperlink graph), the web content (vector model) and...

متن کامل

Automatic Image Annotation by Mining the Web

Automatic image annotation has been becoming an attractive research subject. Most current image annotation methods are based on training techniques. The major weaknesses of such solutions include limited annotation vocabulary and labor-intensive involvement. However, Web images possess a lot of texts, and rich annotation of samples is provided. Therefore, this report provides a novel image anno...

متن کامل

Eliminating Noisy Information in Web Pages using featured DOM tree

The exact information retrieval from the Web is now a great challenge for the researchers to device new methodologies for web mining. Due to the massive information on the Web, the size and number appear to be growing rapidly at an exponential rate which is often accompanied by a large amount of noise such as banner advertisements, navigation bars, copyright notices, etc. Although such informat...

متن کامل

On learning to predict Web traffic

The ease of collecting data about customers through the Internet has facilitated the process of developing large repositories of data. These data can and do contain patterns that are useful for the decision maker. Knowledge discovery and data mining methods have been widely used to extract these patterns. It is acknowledged that about 80% of the resources in a majority of data mining applicatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003